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Abstract 

The Starlink Hierarchical Data System has been a very successful niche astronomy file format and library for over 30 years. 
Development of the library was frozen ten years ago when funding for Starlink was stopped and almost no-one remains who 
understands the implementation details. To ensure the long-term sustainability of the Starlink application software and to make 
the extensible N-Dimensional Data Format accessible to a broader range of users, we propose to re-implement the HDS library 
application interface as a layer on top of the Hierarchical Data Format version 5. We present an overview of the new implementation 
of version 5 of the HDS file format and describe differences between the expectations of the HDS and HDF5 library interfaces. We 
finish by comparing the old and new HDS implementations by looking at a comparison of file sizes and by comparing performance 
benchmarks. 
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1. Introduction 

The Hierarchical Data System (HDS) was created by the 
Starlink Project in the United Kingdom in the early 1980s (Dis¬ 
ney and Wallace, 1982; Lawden, 1991). The requirements were 
to have a file format that was optimized for data processing ap¬ 
plications: allowing for efficient access of data arrays through 
memory mapping, grouping of related data structures in a hier¬ 
archy to make it easy to move data en masse from one location 
to another, and easy modification of data and structures. At the 
time FITS (Wells et al., 1981) was mainly thought of as a trans¬ 
port format distributed on tape (Greisen et al., 1980), and the 
NCSA Hierarchical Data Format would not be developed un¬ 
til the end of the decade (Folk, 2010; Krauskopf and Paulsen, 
1988). It was therefore decided to develop a new file format 
from scratch. Initially called the Starlink Data System before 
being rebranded as HDS, the first version of the library was 
written in BLISS-32 before being ported to C on VAX/VMS in 
the late 1980s (Lupton, 1989). 

HDS succeeded in its goal of forming the basis of the Star- 
link data reduction software packages (ascl: 1110.012) and was 
and is being used at UK observatories for data acquisition and 
for data reduction pipelines (Bell et al., 2014; Jenness and 
Economou, 2011; Jenness and Economou, 2015). Its presence 
was pervasive within the Starlink software stack, being used 
for parameter storage in the ADAM system (Allan, 1992) and 
in a graphics database system (Eaton and Mcllwrath, 2014) in 
addition to storing astronomy data and forming the basis of 
the Starlink A-Dimensional Data Eormat library (NDE; Jenness 
et al., 2015, ascl: 1411.023). There was very little take up of the 


^Corresponding author 

Email address: tjenness@cornell.edu (Tim Jenness) 


format outside the UK community (but see e.g., Meyerdierks, 
1993) and as EITS came to be used as a data processing format 
as well as a transport and archive format, HDS has become a 
niche product. 

HDE5 is a popular file format in other scientific disciplines 
and is used in fields such as Earth science (e.g., Yang et al., 
2005), biology (e.g., Dougherty et al., 2009), nuclear physics 
(e.g., Pedersen et al., 2013), and molecular simulations (e.g., 
de Buyl et al., 2014). The astronomy community is currently 
discussing the wider issues of file formats beyond PITS (Mink 
et al., 2015; Thomas et al., 2015) and HDP5 is being adopted 
(e.g., Alexov et al., 2012) or investigated (e.g.. Price et al., 
2015; Schaaf et al., 2015) in a number of astronomy projects. 

Given this context it is therefore worth investigating whether 
there should be a new version of HDS that is based on HDP5. In 
this paper we compare the HDS and HDP5 data models, discuss 
the motivations for such a change, and describe an implemen¬ 
tation. 

2. Motivation 

There are a number of key motivators for migrating from the 
current HDS format to a more widely-recognized format: 

1. Opaque implementation details of the library and format 
with no resident expert or associated documentation. 

2. Lack of support for 64-bit dimensions sizing. 

3. HDS has no provision for transparent data compression. 

4. HDS has no native support for tables. 

5. Sociological impediment to adopting a niche format in the 
wider astronomy community. 

We will discuss each of these in turn. 
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2.1. Opaque implementation 

The Starlink Project was closed in 2005 after 25 years of 
operation. The Starlink Software Collection continues to be 
developed as an open-source project sponsored by the Joint 
Astronomy Centre to continue support for their data reduction 
pipelines. Unfortunately none of the remaining developers un¬ 
derstand the implementation details and there is no incentive for 
anyone to learn given other development priorities. The most 
recent description of the internals of the HDS file format is in 
Lupton (1989) which documents version 2 of the format. Ver¬ 
sion 3 (from the port to Unix in 1991) and version 4 (in 2005 to 
support files larger than 2GB) remain undocumented, outside of 
extensive comments in the code itself, with an assumption that 
the design closely matches the original layout. For a data for¬ 
mat that is used as an archive format (see e.g., Economou et al., 
2015) this lack of documentation and understanding represents 
a risk for long-term access to the data. 

2.2. 64-bit dimension sizes 

The HDS library does not currently support 64-bit dimension 
sizes. This was partially implemented for version 4 of the for¬ 
mat but was never completed. It is not clear how much work it 
would be to finish this work or whether anyone remaining can 
do so. As data rates increase it is clear that the next generation 
of heterodyne arrays (e.g., Jenness et al., 2014) will generate 
data cubes that exceed the capacity of a 32-bit integer counter 
and these files will render the Starlink software and associated 
data pipelines unusable without HDS being upgraded. Adding 
64-bit support to HDS is a necessary but not sufficient step to¬ 
wards the applications supporting larger datasets, and most li¬ 
braries and applications in Starlink will need to be updated to 
support 64-bit counters. HDS is the fundamental building block 
that has to be converted first, providing new APIs to allow both 
32-bit and 64-bit application code to co-exist. 

2.3. Data Compression 

HDF5 supports transparent data compression of datasets us¬ 
ing a number of algorithms in addition to supporting a plug¬ 
gable architecture. HDS does not support any data compression 
and relies on the facilities of the NDF library to provide these. 
NDF can support gzipped data files, requiring a temporary file, 
as well as FITS-style BSCALE/BZERD and a delta compression 
scheme natively. Feveraging the HDF5 compression algorithms 
would be a very easy way to improve the data compression per¬ 
formance in NDF. 

2.4. Native Tables 

Unlike FITS (Harten et al., 1988) or HDF5, HDS does not 
have a native table data structure. Tables must be implemented 
as a collection of independent columns and this can be ex¬ 
tremely inefficient for row access. Switching data formats 
would make it possible to add a native table access API to HDS. 


2.5. Sociology 

One of the impediments to adoption of the NDF data model 
is that the reference library implementation uses HDS as the un¬ 
derlying file format. Whilst the NDF data model itself does not 
require HDS and could be implemented by anyone with suffi¬ 
cient effort available, to adopt NDF currently requires that users 
adopt HDS. Adopting HDF5 would considerably lower the bar¬ 
rier to entry for people more comfortable in the HDF5 world or 
who are considering switching from another format, and make 
data files accessible to tools such as HDFview and h5py. Of 
course, none of these tools will understand the NDF data model 
that defines the hierarchical grouping but it makes it easier for 
other tools to adopt some of the same conventions. Similarly, 
the Starlink file format conversion tools (Currie et al., 1996) 
would be able to import formats such as FITS to HDF5 and 
this infrastructure may prove to be useful for people who are 
themselves switching to HDF5. 

3. Features of HDS 

How HDS is used depends on a data model and the data ac¬ 
cess model. Both of these are important when considering a 
change in implementation. 

3.1. Data Model 

HDS is a hierarchical file format where named structures can 
contain other named structures or named primitive data arrays. 
It is self-describing in the sense that each layer in the hierarchy 
can be queried to obtain the number of members below (for 
structures), their names and their types. 

The primitive objects support numerical and string types with 
up to 7 dimensions. The supported data types are shown in Ta¬ 
ble 1 and the choice and names reflect the origins of the format 
as a library designed to be used from Fortran. 

All HDS components are typed and this includes structures. 
Structure typing is important in HDS as it can be used by an 
application to decide whether a structure can be understood or 
not. In object-oriented nomenclature the type can be thought of 
as the object class, whereas the individual named structure is an 
object instance. 

HDS supports the concept of arrays of structures to allow a 
collection of identical structures to be grouped together. This 
is used extensively in the lower-levels of the Starlink software 
stack, for example to support history recording (an array of 
structures of type HISTORY which is extended each time a new 
history record is created), or picture definitions in the graphics 
database. 

3.2. Data Access Model 

To obtain access to an object within an HDS file the caller 
must obtain a locator, an opaque C struct containing infor¬ 
mation about the object needed by the HDS library. These lo¬ 
cators mediate all access to the HDS data file. 

Component copying. As a consequence of the hierarchical de¬ 
sign, it is possible to copy or move arbitrary parts of the tree to 
other locations within a file or locations in different files. 


Table 1: HDS basic data types. The unsigned types did not correspond to standard Fortran 77 data types and were included for compatibility with astronomy 
instrumentation. HDS supports both VAX and IEEE floating-point formats. The API code indicates the letter appended to function names to indicate the type they 
support. This convention is used for the generic templating system (Beard et al., 2006). Eor compatibility with HDS the _LOGICAL type is a 32-bit bitfield type 
in memory but stored in an HDF5 file using 8 bits. HDS strings are always stored in Fortran space-padded form and that convention is adopted in the HDF5 HDS 
implementation. 


Name of type 

API Code 

Data type 

HDF5 Data type 

_BYTE 

b 

Signed 8-bit integer 

H5TJS1ATIVEJNT8 

_UBYTE 

ub 

Unsigned 8-bit integer 

H5TJMATIVE.UINT8 

_W0RD 

w 

Signed 16-bit integer 

H5TJS1ATIVE.INT16 

_UW0RD 

uw 

Unsigned 16-bit integer 

H5TJS1ATIVE.UINT16 

_INTEGER 

i 

Signed 32-bit integer 

H5TJS1ATIVEJNT32 

_INT64 

k 

Signed 64-bit integer 

H5TJS1ATIVEJNT64 

.LOGICAL 

1 

Boolean 

H5TJS1ATIVE.B8 

.REAL 

r 

32-bit float 

H5TJS1ATIVEJLOAT 

.DOUBLE 

d 

64-bit float 

H5TJMATIVE.DOUBLE 

.CHAR [*n] 

c 

String of 8-bit characters 

H5T.STRING 


Primary and secondary locators. The library has a concept of 
primary and secondary locators such that when all primary lo¬ 
cators associated with a file are freed (or annulled in HDS par¬ 
lance) all resources associated with that file are also closed and 
all secondary locators become inactive. 

Locator Groups. It is possible to assign locators to a named 
group. Any child locators are also members of the group. When 
the group is no longer required it can be flushed with a single 
command, freeing all the locators that are in the group. This 
simplifies the management of large numbers of related locators 
and allows the resources to be freed at one place in the code 
without having to store them all in user code. 

Slicing. Arrays of structures can not be accessed directly but 
must instead be accessed by requesting a specific cell. Primi¬ 
tive data arrays can also be accessed by individual cells but it is 
more common to access data arrays by specifying slices. A slice 
can be requested by specifying upper and lower bounds of each 
dimension. The Fortran heritage requires that these bounds are 
indexed starting with a lower bound of 1 rather than 0, and all 
data arrays are specified in Fortran order; even from the C in¬ 
terface. In some cases the dimensionality is unimportant and 
the library allows a locator to be vectorized such that subse¬ 
quent interrogations of the locator will indicate that the object 
is 1-dimensional regardless of the underlying shape. This can 
be very useful for such activities as examining every element in 
turn, or picking the first few elements. Vectorizing works for 
structures and primitives and does not affect the file itself. 

Automatic type conversion. For primitive arrays, the data to be 
stored or the data to be retrieved do not have to be the same type 
as the format of the data stored on disk. Floating point data will 
be converted to integer and vice versa. Also, string and log¬ 
ical/boolean types will be converted to numbers and numbers 
can be retrieved as strings or logicals. Endianness and float¬ 
ing point representation is also handled transparently, and the 
native form is used when a file is created. 


Memory Mapping. One of the initial requirements for HDS 
was efhcient access to data arrays. This was done using di¬ 
rect mapping of the relevant part of the file into memory' and 
was implemented for read and write operations. The emory 
mapping facility can be enabled or disabled by use of an envi¬ 
ronment variable and an in-memory solution is used on systems 
that do not support memory mapping. 

4. Requirements for an Updated Format 

The Starlink software collection consists of more than 2.3 
million lines of Fortran, C and C-i-H- and a large fraction of that 
code depends on the HDS library and the HDS API. This in¬ 
cludes fundamental infrastructure such as ADAM that is used 
by all applications. It is therefore imperative that the API for 
HDS remains the same even with the implementation changing 
underneath. Any new version of HDS should meet the follow¬ 
ing requirements: 

1. The API should not change. 

2. It should be possible to use both old and new format files 
in the same application. 

3. The application should behave in the same way with new 
files as it does with old files. 

4. The application source code should not need to be modi¬ 
fied in any way to use the new library. 

5. The new format should not impact performance of the ap¬ 
plication in a negative way or require more computer re¬ 
sources. 

These are similar to the requirements described when 
NetCDF version 4 was implemented on top of HDF5 (Rew and 
Hartnett, 2004). 
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5. HDS Version 5 


Given the broad adoption of HDF5 in the scientific commu¬ 
nity and the close similarity in key parts of the data model 
between it and HDS, it was decided to write a prototype im¬ 
plementation of the HDS API in terms of HDF5. This would 
provide information on the feasibility of the approach and also 
highlight the areas where the data models or access models di¬ 
verge. The previous version of HDS^ was version 4 so it was 
decided this version would be version 5. In the rest of this pa¬ 
per we use the shorthand HDSv5 to refer to the new library 
implementation and format, and HDSv4 to refer to the current 
version of the HDS format and library. 


5.7. Library Architecture 

In order to support both new and old file formats it was nec¬ 
essary for the new library to have access to a complete copy of 
the existing library. The HDF5-based library and the HDSv4 
library are both standalone libraries that are linked in to a wrap¬ 
per library that implements the public interface (see Fig. 1). The 
versioned libraries can be configured to provide the public API 
but when used as part of the unified wrapper they are built with 
names that include the version number to avoid symbol clashes. 

The wrapper library is responsible for forwarding the calls to 
the correct underlying library. There are four major API styles 
that must be handled: functions that open files and return lo¬ 
cators, functions that create files, functions that copy from one 
locator to another, and functions that work with a single locator. 

When a request is made to open a new file, it is first sent to 
the HDSv4 library to see if it opens. If that fails due to the 
format being invalid, the HDSv5 library is used to open the file. 
When migration to the new format is substantially complete the 
wrapper will be modified to default to using the HDSv5 library 
first. One caveat is that the library must ensure that HDSv5 files 
are written to disk immediately on creation^ such that the HDF5 
superblock signature is written. Without this step H5FisJidf5 
will not correctly determine that a newly created file is an HDF5 
file if it has not yet been closed and some Starlink applications 
and libraries rely on the ability to create an HDS file and then 
open it in another part of the code without having annulled all 
previous locators beforehand. 

When files are to be created the choice of format is controlled 
by a tuning parameter. Tuning parameters in HDS can be set 
programmatically or by reading the environment. By default, 
files are still created in HDSv4 format using the principle of 
least surprise. The ability to control this behavior from an envi¬ 
ronment variable simplifies testing and benchmarking. 

When copying one locator to another locator of a different 
type, tree-walking code had to be written using the HDS pub¬ 
lic API. The code recursively walks through structures copying 
primitives and other structures as required. 


^Somewhat confusingly the library implementing version 4 of the file format 
is itself version 5 

^Calling H5Ff lush in hdsNew 
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Figure 1: Architecture of the HDF5-based implementation of HDS. A wrapper 
library with the public HDS API forwards calls to the correct version of the 
library. The Fortran interface is a separate library as it also contains Fortran 
code that would require a Fortran runtime library. 


The bulk of the API takes a single locator as input and does 
something with it that may or may not result in a new locator be¬ 
ing created. We have taken a slightly different approach to that 
described in Rew and Hartnett (2004). In that paper they reg¬ 
istered function table lookups with the newly created data ob¬ 
jects, allowing efficient forwarding to the particular library. We 
decided to take a simpler approach whereby the locator struc¬ 
tures in HDSv4 and HDSv5 were adjusted so that they both 
included a version integer as the first member. The wrapper 
code then simply checks for the version number in the structure 
and calls the relevant routine. This approach does simplify the 
addition of debugging messages and error reporting from each 
routine at the expense of some calling efficiency. 

The wrapper code responsible for this forwarding is gener¬ 
ated from the public HDS header file using a simple Python 
program. This allows the forwarding scheme to be changed rel- 
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atively easily. 

5.2. The Locator Interface 

As mentioned previously, a locator is an opaque C struct 
containing information about a particular object in the HDS file. 
The size of of these structures differs between the two imple¬ 
mentations and this required that a change be made to the For¬ 
tran interface of the HDS library. For historical reasons an HDS 
locator is stored in a Fortran character array with a length speci¬ 
fied by the HDS constant DAT__SZLDC (currently 16 characters). 
In HDSv4 the C structure is really a proxy for an internal data 
structure and the Fortran interface copied the contents of the 
structure to and from the Fortran string. In HDSv5, the C struc¬ 
ture is significantly larger and it was unreasonable to increase 
the Fortran locator size. To support both implementations the 
Fortran interface was changed such that the structure address 
was stored in the Fortran string buffer and the size of the string 
was kept at 16 characters. This allowed the new library to be 
installed without requiring that any applications be relinked. 

5.3. Error handling 

HDS uses the Starlink Error Message Service (EMS; Rees 
et al., 2008) for error handling. EMS uses the concept of inher¬ 
ited status where each function takes a status argument and usu¬ 
ally returns immediately if status is non-zero (when resources 
are to be freed it is usual for the freeing routine to try to execute 
regardless). If an error condition is to be reported by a function 
it sets the status to an appropriate value and attaches an error 
message to the message stack. As the call stack unwinds the 
error could either be annulled (the calling function may wish to 
react to the error by trying an alternative) or be augmented with 
more information. 

HDE5 uses a similar error message stack and status code con¬ 
cept internally but uses a function return value to indicate to the 
external user that a problem has occurred. If the return value is 
negative the call failed. The error messages and specific status 
code must then be retrieved separately. In HDSv5 each call to 
HDE5 is wrapped by a C macro that intercepts the status re¬ 
turn and if necessary queries the HDE5 error message stack and 
places each of the messages on to the EMS message stack. 

In some cases the HDE5 status code is translated to an equiv¬ 
alent HDS error code but in many cases the HDE5 codes are 
not specific enough and in that case a generic error from HDF5 
code is used. 

5.4. Data Model 

In HDE5, structures are known as groups and primitives are 
known as datasets. Table 1 shows the mapping of HDS data 
types to the HDE5 equivalents and the type system is signifi¬ 
cantly more advanced in HDE5. It was decided that boolean 
types should be represented in the hies as 8-bit bithelds rather 
than the 32-bit integer type that is used (part of the Eortran 
legacy). The in-memory datatype is a 32-bit integer for consis¬ 
tency with the public API but the smaller type is used on disk. 
A bitheld type is used as this allows the HDS type query to be 


able to distinguish the _BYTE type from .LOGICAL type with¬ 
out requiring the use of HDE5 attributes. Strings in HDS are 
space-padded hxed size following the Eortran style and this is 
how strings are stored in HDSv5. Datasets are stored in HDE5 
hies in C dimension order with the dimensions being reversed 
when viewed from HDS. This is the same approach taken by 
the HDE5 Eortran interface with the variation that the HDS C 
view of an array must agree with Eortran. 

HDE5 has no concept of arrays of structures so this facility 
is implemented entirely by the HDSv5 library. The contain¬ 
ing group is created and within it are placed the number of 
groups corresponding to the array size. Each of these groups 
is given a name that contains a root string chosen to deliber¬ 
ately be longer than the maximum allowed length of an HDS 
component, appended with the coordinates of the structure in 
the array. Eor a 2-dimensional array of structures the name 
of the group could be ARRAY.DF.STRUCTURES.CELL (2,3) for 
the group at coordinate (2,3). This naming scheme simpli- 
hes access to an individual structure (just provide the coor¬ 
dinates) and also simplihes reporting of the full path using 
HDS nomenclature: to convert the HDE5 path of the struc¬ 
ture RQDT/HIST0RY/ARRAY_0F_STRUCTURES_CELL(3) to the 
HDS path, just requires the removal of the hxed cell prehx to 
convert it to ROOT.HISTORY(3)"^. The long structure name is 
hidden by the HDS library and only visible when the hie is 
accessed using HDE5 tools. When an array of structures has 
been created the dimensionality is stored in an attribute named 
HDS.STRUCTURE.DIMS. In the future we will consider imple¬ 
menting structure arrays using the HDE5 feature allowing ref¬ 
erences to arbitrary HDF5 objects to be stored in a dataset, this 
would have the advantage of reducing the structure complexity 
and would simplify cell access. 

Einally, the data type of a structure is not a fundamental part 
of HDF5 so this information is stored in an attribute with name 
CLASS following the convention used in other HDF5 data mod¬ 
els such as the Image and Palette classes. 

Fig. 2 shows a comparison of the HDS and HDF5 view of 
the same data hie. These traces show that the mapping from 
HDS structure/primitive to HDF5 group/dataset is being fol¬ 
lowed with three attributes added to provide the metadata re¬ 
quired by HDS. 

5.5. Primary locators 

In HDF5 the hie is kept open until all identihers associated 
with a hie are closed. HDS distinguishes primary identihers 
from secondary identihers such that a hie is closed when the 
count of active primary locators reaches zero, even if some ac¬ 
tive secondary locators remain. To implement this in HDSv5 it 
is necessary to store every locator that is allocated in a global 
data structure. We use the uthash macros (Hanson, 2014) to 
implement a hash table indexed by the hid.t HDF5 hie identi- 
her. Each hie identiher key then maps to a utarray dynamic 
array containing the locators. The individual locators have a 


“^HDS uses dot separators rather than directory separators when specifying 
a path within a data file. This will be familiar to VMS users. 
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(a) 


IMAGE <NDF> 

DATA_ARRAY <ARRAY> {structure} 

DATA(2) <_INTEGER> 1,2 

(b) 

GROUP "/" { 

ATTRIBUTE "CLASS" { 

DATATYPE H5T_STRING {} 

DATASPACE SCALAR 
DATA { 

(0): "NDF" 

} 

} 

ATTRIBUTE "HDS_ROOT_NAME" { 

DATATYPE H5T_STRING {} 

DATASPACE SCALAR 
DATA { 

(0): "IMAGE" 

} 

} 

GROUP "DATA_ARRAY" { 

ATTRIBUTE "CLASS" { 

DATATYPE H5T_STRING {} 

DATASPACE SCALAR 
DATA { 

(0): "ARRAY" 

} 

} 

DATASET "DATA" { 

DATATYPE H5T_STD_I32LE 
DATASPACE SIMPLE { ( 2 ) / ( 2 ) } 
DATA { 

( 0 ): 1 , 2 
} 

} 

} 

} 


Figure 2: Comparison of the HDS view of an HDF5 data file with the HDF5 
view, (a) Listing of a HDSv5 file using the standard HDS tracing tool, hdstrace 
(Currie, 2014). (b) Listing of the HDF5 file corresponding to the HDS struc¬ 
tures. The definition of the H5T_STRING datatype has been elided for clarity. 
The listing was made using the standard hSdump command. 

flag indicating whether they are primary or secondary. The 
uthash macros were chosen since they did not require an ad¬ 
ditional library, they used a BSD license that was compatible 
with the HDSv5 library license, and the programming interface 
was reasonably straightforward. 

When each locator is annulled these data structures are 
scanned to check whether this locator was the final primary lo¬ 
cator. If it is, all remaining locators are themselves annulled. 
One complication is that the file identifier returned will be dif¬ 
ferent for each call to HSFopen so it is important to determine 
which file identifiers are associated with the same file. With¬ 
out accounting for this, critical locators may be annulled at the 
wrong time. Rather than attempt to guess what the HDF5 li¬ 
brary has chosen to do by normalising supplied filenames, the 


virtual file driver layer is queried to obtain the Unix file descrip¬ 
tor. All file identifiers with a shared file descriptor are queried 
before deciding whether a file should be closed. 

One consequence of this behind-the-scenes freeing of re¬ 
sources is that it is possible for a library user that does not 
understand the distinction between primary and secondary lo¬ 
cators to be left with pointers to structures that have been freed. 
To prevent unfortunate crashes, when a locator is freed auto¬ 
matically the contents of the structure are reset but the structure 
itself is not freed. This does result in a small memory leak but 
is thought to be more acceptable than a core dump. 

5.6. Locator groups 

Locator groups are not a feature of HDF5 and were imple¬ 
mented natively in the HDSv5 library. The implementation is 
similar to the primary locator system described previously ex¬ 
cept that the key for the uthash mapping table is the group 
name rather than the file identifier. When a group is flushed, all 
locators in the group are annulled and the group is deleted from 
the hash table. 

5.7. Array slicing 

A very powerful feature of HDF5 is the concept of a datas- 
pace. A dataspace determines the rank and dimensions of a 
dataset and is used to specify the size of the HDS primitives. 
When a slice or cell request is made a single hyperslab selec¬ 
tion is made which adjusts the external view of the dataset. 
HDS slices and cells are much simpler than what is possible 
in a hyperslab selection, and are restricted to simple subsets of 
a region. 

When a locator is vectorized the dataspace associated with 
the locator is reshaped to be 1-dimensional. Subsequent slices 
of that vectorized dataspace are then handled in the same way 
as before using a hyperslab. 

5.8. Type conversion 

HDF5 supports an extremely broad range of data types and 
automatic conversion of numerical types when storing or re¬ 
trieving a dataset. Critically, HDF5 does not support type con¬ 
version of string and logical types to numeric types (and vice 
versa) so this facility has been added explicitly in the HDSv5 li¬ 
brary to maintain compatibility. This is simplified by HDS hav¬ 
ing the concept of a “bad” or “magic” value for each datatype 
that can be used to indicate where a conversion was not possi¬ 
ble. 

5.9. Memory mapping 

An important requirement for any HDS implementation is to 
support direct memory mapping of files for both read and write 
operations. This has worked well over the years and helps min¬ 
imize resource requirements. HDF5 has other priorities and ad¬ 
vocates chunked access to minimize resources rather than pro¬ 
viding direct access to the bytes on disk. The ability to split 
a dataset into multiple chunks and to insert arbitrary compres¬ 
sion filters and virtual file drivers between the bytes trumps any 


6 


perceived advantages of memory mapping. In the HDSv5 im¬ 
plementation memory mapping is only attempted if files are in 
read only mode, if HDF5 will return the byte offset to the start 
of the dataset, if the HDF5 type system indicates that the in¬ 
memory data type and the on-disk data types are compatible, 
and if the virtual hie driver will provide a hie descriptor. In 
all other cases memory is allocated using standard system calls 
when a user requests memory mapping, and the data are written 
back to the hie when the data are “unmapped”. Mapping can 
also be disabled using a tuning parameter. 

As a test a 4 GB dataset was loaded into the GAIA visual¬ 
ization tool (Draper et al., 2009, ascl: 1403.024). With mem¬ 
ory mapping enabled the image displayed within two seconds 
and the process only used a fews tens of megabytes of mem¬ 
ory. With memory mapping disabled it took about ten seconds 
to load the image and the process took 5 GB of memory. 

The ability to memory map at all requires that datasets are 
created in single chunks and are not resizable. This causes some 
problems with HDS which assumes that all primitive objects 
are resizable. The HDSv5 dataset resizing function therefore 
attempts to use the native HDF5 resizing function but is usually 
forced to create a new dataset and copy the contents from the 
existing dataset, before deleting the original and renaming the 
new dataset. This can result in signihcant unused space in an 
HDF5 file. 

6. Implementation Issues 

The prototype library has largely shown that replacing native 
HDS with an HDF5 implementation is feasible. Unfortunately 
we have found that there are some incompatibilities that have 
required minor code changes to Starlink applications. So far 
these have been restricted to applications that open an input hie 
with read access and then open an output hie with read/write 
access. If the input hie and output hie are the same hie (for 
example when copying a structure within a hie), HDS had no 
issue with this but in HDF5 this is strictly forbidden due to the 
internal tracking of open hies. The changes to Figaro (Short- 
ridge, 1993, ascl; 1411.022) and Hdstools (Chipperheld, 2002) 
result in the input being opened to validate it but recording the 
full path to the requested object. Then the input is closed and 
the output re-opened in read/write mode. Once this happens 
the input hie can be re-opened and the application can continue 
as before. The modihcation also works with HDSv4 so can be 
adopted at the expense of some more convoluted code. 

When designing the mapping of HDS to HDF5 some care 
was taken to not deliberately restrict the ability of the HDSv5 
library to read HDF5 hies that were not created by the library. 
To that end, attributes were chosen that were either already in 
common usage, e.g., CLASS, or were chosen such that the ab¬ 
sence of the attribute would result in reasonable behavior (root 
naming and structure dimensions). However, the implementa¬ 
tion can not work miracles in dealing with the mismatch be¬ 
tween the HDS and HDF5 data models. In particular, the HDS 
data model has no concept of attributes in the sense that HDF5 
has them. Figure 3 shows the output of an HDS tracing pro¬ 
gram on a hie created from a FITS hie as described by Price 


HDF5R00T <HDFITS> 

PRIMARY <HDU> {structure! 

DATA(35,35) <_W0RD> 5419,5419,5332,6025, 

... 6659,6659,6572 

HEADER <HDF5NATIVEGR0UP> {structure} 

{structure is empty} 

!! Invalid name string 'Photometric CALTABLE' 

! specified; more than 16 characters long 

Figure 3: Output from the HDS stmcture tracing application, hdstrace on an 
HDF5 file created without using the HDS library. HDS has read the CLASS at¬ 
tributes when available and replaced optional attribute values with placeholders 
for the name of the root group and the type of the HEADER structure. Unfor¬ 
tunately a group in the file has a name that exceeds the HDS fixed-width limit 
preventing HDS from accessing it. Furthermore, the HEADER structure is not in 
fact empty as all the FITS headers are actually stored as HDF5 attributes and 
these are invisible to the HDS data model. 

et al. (2015). HDS is able to read some of the hie contents but 
fails to read groups with names that exceed 16 characters. This 
limit can be increased by recompiling all Starlink applications 
but HDS relies on this limit being hxed at compile time. A 
more complex solution would be for HDS to return a shortened 
form of the name to the HDS API, possibly keeping track of 
the mapping from long name to short name internally. It is cur¬ 
rently unclear how important it will be to handle this situation. 
What is not obvious from this trace is that the HEADER structure 
is not empty; all the FITS headers are stored as attributes. 

7. Metrics 

When considering adoption of a new format it is important to 
consider any performance differences and whether the hies use 
up differing amounts of storage. These tests used HDF5 version 
1.8.13 and a late 2014 version of HDSv4. 

7.1. File Sizes 

Test datasets were generated comparing the new format hie 
sizes with the original hie sizes. ^ A comparison is shown in 
Table 2. The hies were generated as follows: 

1. The AGI graphics database generated by the SpecDRE 
demonstration script (Meyerdierks, 1992, ascl: 1407.003). 
The graphics database makes extensive use of arrays of 
structures and resizing of elements. The HDF5 variant is 
more than hve times larger than the HDSv4 variant with 
20 % of that accounted for by empty space. 

2. An ADAM parameter hie generated from the execution 
of the ccDBiG (Taylor, 1998) exercise script. 34% of the 
HDF5 hie is empty space. Like the graphics database hie, 
this hie is updated constantly during program execution. 


^All these tests were done using the default file access property list settings. 
Selecting the latest format, via H5Pset_libver_bounds, results in slightly 
smaller files for three of the four tests but a larger file in the parameter file test. 
It has not yet been decided whether HDSv5 should adopt maximal backwards 
compatibility for files or always be on the cutting edge. 
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Table 2: A comparison of file sizes resulting from identical operations where solely the file format is changed. Sizes are given for the natives files and gzipped 
versions. Also included are the sizes from using HDF5 native SHUFFLE/GZIP compression. The first two rows are from files that are continuously updated during 
processing; the remaining rows are from statically created data sets. All files sizes are in bytes. 


Eile type 

HDSv4 

HDSv5 

v5/v4 

HDSv4 (gz) 

HDSv5(gz) 

v5(gz)/v4(gz) 

HDP5 comp. 

AGI database 

35 328 

191 664 

5.43 

4473 

9 689 

2.17 

403 608 

Parameter file 

5 632 

18 600 

3.30 

745 

1318 

1.77 

19 800 

SCUBA-2 

18712576 

18 796 530 

1.00 

15 530536 

15 537 267 

1.00 

14660769 

KAPPA logo 

406016 

411272 

1.01 

30633 

31 123 

1.02 

53 827 


3. A SCUBA-2 acquisition file (Bintley et al., 2014), which 
contains lots of data as well as table structures and is writ¬ 
ten in a single operation. 

4. KAPPA (Currie and Berry, 2013, ascl: 1403.022) logo image 
consisting of a simple NDF with WCS and FITS header. 

The numbers indicate that for small files with many struc¬ 
tures the HDSv5 files are significantly larger. Some of this 
may be due to the inability to resize datasets without deleting 
them but even if the files are repacked they are still larger than 
HDSv4 versions. For larger data files the situation is less clear 
cut with the scientific data dominating the file contents the over¬ 
head from HDF5 is much lower. The advantages of HDF5 be¬ 
come obvious once native data compression is used with the 
SCUBA-2 example file becoming 6% smaller than even the 
gzipped version of the HDSv4 format. 

7.2. Benchmarks 

The library has been tested on a number of standard Star- 
link benchmark routines from ccdpack (Draper et al., 2011, 
ascl;1403.021), ccdbig, and starbench (Rankin et al., 2003). An 
example data reduction test was also executed using SCUBA-2 
data and the orac-dr pipeline (Jenness and Economou, 2015, 
ascl;1310.001)®. The results are shown in Table 3 and indicate 
that for tasks using lots of small files with lots of I/O HDSv4 is 
much faster. As the tests begin to use more real-world process¬ 
ing tasks with larger datasets the difference disappears and both 
libraries perform to within a few per cent. The final test involv¬ 
ing ORAC-DR indicates that there may be a small performance 
advantage to not using memory mapping and this result may 
inform later decisions on whether to switch to using resizable 
chunked datasets in the future. 

8. Conclusion 

A new HDF5-based implementation of the HDS program¬ 
ming interface has been written which allows the Starlink soft¬ 
ware collection to be moved to a more widely-used file format. 
All but a handful of the approximately 150 HDS API functions 
have been implemented, with the remaining few being the rou¬ 
tines that query low level implementation details. The HDSv5 


®These were observations 28, 31, 35, 44 and 51 from 2012 June 11th, re- 
duced using the JCMT Science Archive public processing recipe (Bell et al., 
2014) 


Table 3: Various Starlink benchmarking tests. All times are in seconds. The 
items with an asterisk indicate the benchmai'k was run with memory mapping 
disabled. 



HDSv4 

HDSv5 

v5/v4 

starbench/specdre 

0.82 + 0.02 

1.25 + 0.01 

1.52 

starbench/kappa 

8.58 + 0.46 

9.88 + 0.09 

1.15 

CCDPACK 

5.02 + 0.08 

5.79 + 0.02 

1.15 

CCDBIG 

55.42 + 0.59 

54.04 + 0.07 

0.98 

ccdbig(*) 

56.08 + 0.28 

54.64 + 0.19 

0.97 

ORAC-DR 

486 + 5 

482 + 7 

0.99 

orac-dr(*) 

450 + 2 

465 + 3 

1.03 


library consists of approximately 10000 lines of C with another 
5 000 lines of C for the implementation of the wrapper (many 
of those lines are generated automatically). For comparison, the 
HDSv4 library consisted of about 18 000 lines of C and HDF5 
itself consists of about 120000 lines of C. 

It has been shown that the library performs as well as the 
HDSv4 implementation in most tests involving reasonably- 
sized datasets and opens up the possibility for Starlink data 
products to be more easily consumed by others without requir¬ 
ing a format conversion. A native Python interface to the HDS 
library does exist but it is far easier to convince prospective 
consumers of the data files to use something such as h5py (e.g., 
Collette, 2013) to read the data, albeit with a different view of 
the data models. Furthermore, these files would be readable by 
general HDF5 visualization tools. 

The Starlink open-source community must now decide 
whether to pursue this work and integrate it into the Starlink 
software distribution. It is possible that the project will decide 
to stick with HDSv4 and attempt to update the library to support 
64-bit dimension sizes. This is a reasonable course of action to 
take, with an uncertain effort requirement, although it does not 
solve the issues relating to lack of documentation and sociolog¬ 
ical barrier to adoption of the Starlink software. Furthermore, if 
the new implementation is adopted, serious consideration must 
be made as to whether the approximately 4 million HDS files in 
the JCMT Science Archive (Economou et al., 2015) should be 
converted to HDE5. There is a risk involved for the archive 
in terms of the cost of keeping the old versions around and 
whether the conversion has been done correctly. The benefit 
will be that the raw data archive will immediately become more 
accessible to the general astronomer. 
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This work also provides an alternative approach to porting 
FITS files to HDF5 format (see e.g., Price et al., 2015, for other 
options). The Starlink convert package (Currie, 1997; Currie 
et al., 1996) has received significant development effort over the 
years to map FITS to a hierarchical data model. It will be in¬ 
teresting to see whether the community can agree on a standard 
model for FITS to HDF5 conversion. 

Now that a functioning prototype exists and has been proven 
to work acceptably, we must consider the possibility of expand¬ 
ing the HDS API to take advantage of HDF5 features. In par¬ 
ticular compound datatypes provide the prospect of native table 
access (the import of FITS binary tables would benefit signifi¬ 
cantly from this), an updated slicing API could provide access 
to hyperslabs, and the ability to specify chunking size and the 
maximum expected size of a dataset could result in significant 
efficiency benefits, albeit at the expense of memory mapping. 
It may be possible to consider allowing the HDS and HDF5 
APIs to be used simultaneously on a single file. This has many 
attractions and provides a simple path to enhancing native ap¬ 
plications. It also would mean it would be impossible to switch 
HDS from HDF5 to another format in the future. If the Ad¬ 
vanced Scientific Data Format (ASDF; Droettboom and Bray, 
2014; Greenfield et al., 2015) were to suddenly become popular 
in astronomy it would be conceivable to investigate a port of the 
HDS API to ASDF. If HDF5 identifiers had been used natively 
in the code this would be a significantly more complicated task. 
This is somewhat similar to the problems that are faced in port¬ 
ing NDF to other formats. The NDF standard (Currie et al., 
1988) was specifically designed with an “airlock” API that al¬ 
lowed the user to obtain an HDS locator to extensions. This 
flexibility was important in early adoption and provided an easy 
way for extensions to be implemented. It also meant that any 
attempt to switch NDF absolutely required that the HDS API 
was itself ported, otherwise all the extensions in use would be 
unreadable. Indeed one key motivation for this work is that it 
brings NDF along to HDF5 without any NDF code or applica¬ 
tions that use extensions having to be modified. 
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